Today, we will return to exploring Russians’ support the war in Ukraine using a public opinion survey from Russia conducted by Alexei Miniailo’s “Do Russians Want War” project.
The survey was conducted by phone using a random sample of mobile phone numbers to produce a sample of respondents representative of the population in terms of age, sex, and geography. It was in the field from February 28 to March 2.
First, we will explore how support for the war varies with the
demographic predictors age, sex and
education. We will compare the results of modeling this
relationship using Ordinary Least Squares regression
and Logisitic Regression
We’ll talk more about the technical aspects of logistic regression next week. For today we’ll simply compare the results from these two approaches.
Next, we will gain insight into how our estimates from these models might vary using the statistical process of bootstrapping. Specifically, we will simulate the idea of repeated sampling that is the foundation of frequentist interpretations of probability, by sampling from our sample with replacement.
We’ll walk through the mechanics of simulation together. Then you’ll quantify the uncertainty described by these bootstrapped sampling distributions.
Finally, we’ll see what other questions we might ask of these data and practice various skills we’ve developed through out the course.
Plan to spend the following amount of time on each section
Get set up to work (5 minutes)
Model the relationship between demographic predictors and war support using OLS and Logistic regression (20 minutes)
Assess the uncertainty around your estimated coefficients (15 minutes)
Quantify the uncertainy described by your sampling distributions (10 minutes)
Explore other relationships in the data. (30 minutes)
You will work in your assigned groups. Only one member of each group needs to submit the html file produced by knitting the lab.
This lab must contain the names of the group members in attendance.
If you are attending remotely, you will submit your labs individually.
You can find your assigned groups in previous labs
Today’s lab covers a little bit of everything. You will
learn how to model binary outcomes using logistic regression
develop an intuition for how to think about uncertainty using simulations to describe what might have happened
practice formulating, estimating, presenting and interpreting regression models.
As with every lab, you should:
author: section of the YAML header
to include the names of your group members in attendance.First lets load the libraries we’ll need for today.
the_packages <- c(
## R Markdown
"kableExtra","DT","texreg","htmltools",
"flair", # Comments only
## Tidyverse
"tidyverse", "lubridate", "forcats", "haven", "labelled",
"modelr", "purrr",
## Extensions for ggplot
"ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr",
"GGally", "scales", "dagitty", "ggdag", "ggforce",
# Data
"COVID19","maps","mapdata","qss","tidycensus", "dataverse",
# Analysis
"DeclareDesign", "boot"
)
# Define function to load packages
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
ipak(the_packages)
kableExtra DT texreg htmltools flair
TRUE TRUE TRUE TRUE TRUE
tidyverse lubridate forcats haven labelled
TRUE TRUE TRUE TRUE TRUE
modelr purrr ggmap ggrepel ggridges
TRUE TRUE TRUE TRUE TRUE
ggthemes ggpubr GGally scales dagitty
TRUE TRUE TRUE TRUE TRUE
ggdag ggforce COVID19 maps mapdata
TRUE TRUE TRUE TRUE TRUE
qss tidycensus dataverse DeclareDesign boot
TRUE TRUE TRUE TRUE TRUE
There are three packages in particular that we’ll need to maker sure are installed and loaded
modelrpurrrbroomIf ipak didn’t return TRUE for each of
these, please uncomment and run:
# install.packages("modelr")
# install.packages("purrr")
# install.packages("broom")
Next we’ll load the recoded data for the lab
Our primary outcome of interest (dependent variable) for today is a binary measure of support for war:
support_war01 “Please tell me, do you support or do not
support Russia’s military actions on the territory of Ukraine?” (1=yes,
0 = no)Our key predictors (independent variables) are the following demographic variables:
age “How old are you?”
sex “Gender of respondent” (As assessed by the
interviewer)
education_n “What is your highest level of education
(confirmed by a diploma, certificate)?” (1 = Primary school, 2 = “High
School”, 3 = “Vocational School” 4 = “College”, 5 = Graduate Degree)1
load(url("https://pols1600.paultesta.org/files/data/df_drww.rda"))
Please estimate the following models:
m1 using
lm()m2 using
glm() with family=binomial# OLS
# m1 <- ???
# Logisitic
# m2 <- ???
Next, please display the results of your regressions in a table using
htmlreg()
# Regression Table
The coefficients from a logistic regression aren’t easy to directly interpret.
Instead, we will produce predicted values for each model
To do this, we will need to create a prediction
dataframe called pred_df Every variable in your
model, needs to be represented in your prediction data frame.
Use expand_grid() to create a data frame where
age varies from 18 to 99sex is held constant at “Female”education_n is held constant at its mean## Create prediction data frame
# pred_df <- expand_grid(
# age = ??? : ???,
# ??? = ???,
# ???
# )
Then you use the predict() function to produce predicted
values from each model.
Save the output of predict() for m1 to a
new column in pred_df called pred_ols.
For m2 you need to tell are to transform the predictions
from m2 back into the units of the
response (outcome) variable, by setting the argument
type = "response". Save the output of
predict() for m1 to a new column in pred_df
called pred_logit.
# #Predicted values for m1
# pred_df$pred_ols <- predict(???,
# newdata = ???)
# #Predicted values for m2
# #Remember to add type = "response"
# pred_df$pred_logit <- ???
Now we can compare the predictions of OLS and Logistic regression by plotting the predicted values of support for the war from each model.
To produce this plot you’ll need to
data (you want to use the values from
pred_df)aesthetics in
ggplot
age on the x axis and and pred_ols on
the y-axisgeometries to plot
geom_line() to the plotgeom_line())aes of
y=pred_logit# data %>%
# aesthetics with ggplot()
# geometries geom_line()
How do the predictions of the two models compare
So the predictions from OLS produce impossible values (levels of support above 100 percent) at for very old respondents, while the predictions from logistic regression are constrained to be between 0 and 1.
If we think that logistic regression provides a more credible way of modeling support for the war, then our OLS regression appears to overstate the level of support among young and old, while possibly understating the level of support among the middle age. The differences aren’t huge – a few percentage points – but for a binary outcome we will often prefer to model it with logistic regression.
Also note that marginal effect for age in the logistic regression is not constant. An increase in age from 25 to 26 is associated with a larger increase in support, than an increase in age from 75 to 76.
How confident are we that the true relationship between age and support for the war is positive? How different might the coefficient be if we had taken a different random sample of Russians in early March 2022?
In this section, we will explore how we can simulate the idea “repeated” sampling using just the sample we have to quantify uncertainty about the estimates in our model.
To do this we will.
df_drww
with replacement using the bootstrap
function from the modelr package.map function from the purrr
package.Put the coefficients from this bootstrapped model into into a
tidy data frame, using the tidy function
from the broom package, and the map_df
function from the purrr package.
Visualize the sampling distribution for coefficients in our model.
In the code below I demonstrate this process for m1.
Then you will repeat this process for m2
df_drwwBelow we create 1,000 boostrapped samples
# Make sure these packages are loaded
library(modelr)
library(purrr)
library(broom)
# Set random seed for reproducability
set.seed(1234)
# 1,000 bootstrap samples
boot <- modelr::bootstrap(df_drww, 1000)
Let’s take a moment to understand what boot is and why
we’re sampling with replacement.
The object boot contains 1,000 bootstrapped samples from
df_drww.
If we look at the first bootstrap we see:
boot$strap[[1]]
<resample [1,807 x 42]> 1308, 1018, 1125, 1004, 623, 905, 645, 934, 400, 900, ...
The numbers 1308, 1018, 1125, 1004, 623, 905, ...
correspond to rows in df_drww. So person 1308
is the first observation in this boot strap sample, then person
1018 and so on.
Because we are sampling with replacement
observations from df_drww can appear multiple times. In our
first bootstrap sample:
table(table(boot$strap[[1]]$idx))
1 2 3 4 5 6
666 342 104 27 5 2
Why would we want to sample with replacement?
Well, what we’d really love are 1,000 separate random surveys all drawn from the same population in the same way.
Since that’s not feasible, we instead use the one sample we do have to learn things like how much our estimate might vary in repeated sampling. Efron (1979) called this procedure “bootstrapping” after the idiom “to pull oneself up by one’s own bootstraps”
We do this by sampling from our sample with replacement.
When we sample with replacement, we are sampling from our sample, as our sample was sampled from the population.
With replacement means that some observations will appear multiple times in our bootstrapped sample (while others will not be included at all).
When an observation appears multiple times in a bootstrap sample, conceptually, we’re using that original observation to represent the other people like that observation in the population who – had we taken a different sample – might have been included in our data.
Note each bootstrap sample is a different random sample with replacement. In our second bootstrap sample, one observation (person 1496) appeared five times.
table(table(boot$strap[[2]]$idx))
1 2 3 4 5 6
661 342 105 31 1 3
# Person 406
sum(boot$strap[[2]]$idx == 1496)
[1] 5
# Person 1 is not in boostrap 2
sum(boot$strap[[2]]$idx == 1)
[1] 0
Now let’s estimate our model for each bootstrapped sample, using the
map function.
boot, map
estimates the model
lm(support_war01 ~ age + sex + education_n) plugging in the
bootstrap sample into the data=..# bootstrap simulations
bs_ols <- purrr::map(boot$strap, ~ lm(support_war01 ~ age + sex + education_n, data =.))
The end result is a large list with 1,000 separate linear regression models estimated on each bootstrapped sample.
The coefficients from each bootstrap vary from one simulation
# First bootstrap
bs_ols[[1]]
Call:
lm(formula = support_war01 ~ age + sex + education_n, data = .)
Coefficients:
(Intercept) age sexMale education_n
0.305019 0.009746 0.081039 -0.024142
to the next:
# Second boostrap
bs_ols[[2]]
Call:
lm(formula = support_war01 ~ age + sex + education_n, data = .)
Coefficients:
(Intercept) age sexMale education_n
0.245000 0.009142 0.095631 -0.009285
Because they’re estimated off of different bootstrapped samples.
We will visualize and quantify that variation to describe the uncertainty associated with our estimates.
But first, we need to transform our large list of linear models, into a more tidy data frame that’s easier to manipulate.
In the code below we transform this large list of models into a
tidy data frame of coefficients.
# Tidy bootstrapp sims
bs_ols_df <- map_df(bs_ols, tidy, .id = "id")
In the resulting data frame the id variable identifies
the bootstrap simulation (1 to 1,000), the term variable
indentifies the specific coefficient from the model estimated for that
simulation.
head(bs_ols_df)
# A tibble: 6 × 6
id term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 0.305 0.0522 5.84 6.29e- 9
2 1 age 0.00975 0.000702 13.9 3.31e-41
3 1 sexMale 0.0810 0.0222 3.65 2.73e- 4
4 1 education_n -0.0241 0.0107 -2.26 2.39e- 2
5 2 (Intercept) 0.245 0.0533 4.60 4.61e- 6
6 2 age 0.00914 0.000731 12.5 3.36e-34
Finally, let’s get a sense of how our coefficients could vary.
Specifically, let’s compare the the observed coefficient from
m1 for age, to the bootstrapped
sampling distribution of coefficients in
bs_ols_df
p_ols_age that
shows the distribution of the coefficients for age from our
simulationp_ols_age <- bs_ols_df %>%
filter(term == "age") %>%
ggplot(aes(estimate))+
geom_density()
p_ols_age
Next we’ll add some additional geometries and labels to our figure
p_ols_age +
geom_rug() -> p_ols_age
p_ols_age
agep_ols_age +
geom_vline(xintercept = coef(m1)[2],
linetype = 2) -> p_ols_age
Error in coef(m1): object 'm1' not found
p_ols_age
p_ols_age +
theme_bw()+
labs(
x = "Age",
y = "",
title = "Bootstrapped Sampling Distribution of Age Coefficient"
) -> p_ols_age
p_ols_age
Congratulations you’ve just produced and visualized your first bootstrapped sampling distribution!
Conceptually, this distribution describes *how much we would expect the coefficient on age in model to vary** from sample to sample.
Just from eyeballing the figure above, it looks like the observed the relationship between age and support for war or 0.009 could be about as high as 0.011, and as low as 0.007.
Of course, as the budding quantitative social scientists that we are, we can do better than just eyeballing the data.
Specifically, please calculate the standard deviations, 97.5 and 2.5
percentiles of each coefficient in bs_old_df
You’ll want to
group_by() the term variable in
bs_ols_dfsummarize() the output of functions applied to the
estimate column in bs_ols_df# calculate the standard deviations, 97.5 and 2.5 percentiles
Please compare these these estimates to those by the following functions
# summary(m1)
# confint(m1)
Finally, let’s take some time to explore other variables and relationships in the data.
Please do the following:
Research Question: Formulate a brief research question here
Data: Identify the relevant outcome and predictor variables in the data.
# Explore the data
# Data transformations if necessary
\[\text{Outcome} = \beta_0 + \beta_1 \text{Key Predictor} + \beta_2 \text{Covariate}\]
Expectations Discuss the expected sign and significance of the coefficients in your model. If you think a coefficient could be either positive or negative, explain what each of these results means in the context of your question.
Estimation Estimate a model or models to test your question.
# Models
# Regression table
# Figures
I think, google translate was a bit unclear. But higher numbers equal more education.↩︎